fixed masked flash attention #2589

l-k-11235 · 2024-05-17T16:47:29Z

This PR proposes a fix for the flash attention path in the multi-head attention module.
The flash attention block doesn't support the left padding mask, so we apply it upstream.
Dao-AILab/flash-attention#649

We apply the key_pad_mask to the values contained in the KV-cache at the first step only, for all scenari (standard, flash, sdpa mechanisms).

vince62s · 2024-05-28T10:33:08Z

I think it would be great to explain in this PR when and how the key_pad_mask needs to be used, and being clear in the different scenarii (standard, flash, sdpa mechanisms)

l-k-11235 added 3 commits May 17, 2024 18:45

fixed masked flash attention

2c4eded

some code cleaning

e9edb12

cleaned the scaled-dot attention path

506b355

further code cleaning

4fa9721

vince62s merged commit 9991c8d into OpenNMT:master Jun 27, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixed masked flash attention #2589

fixed masked flash attention #2589

l-k-11235 commented May 17, 2024 •

edited

Loading

vince62s commented May 28, 2024

fixed masked flash attention #2589

fixed masked flash attention #2589

Conversation

l-k-11235 commented May 17, 2024 • edited Loading

vince62s commented May 28, 2024

l-k-11235 commented May 17, 2024 •

edited

Loading